Skip to content

Comments

[Feature] Semantics support and remote atomic add#48

Open
tzj-fxz wants to merge 28 commits intomainfrom
tzj-dev
Open

[Feature] Semantics support and remote atomic add#48
tzj-fxz wants to merge 28 commits intomainfrom
tzj-dev

Conversation

@tzj-fxz
Copy link

@tzj-fxz tzj-fxz commented Feb 4, 2026

  • Add semantics and scope support for wait operation
  • Add remote atomic add operation
  • Add example of manually routing alltoall intranode

WIP

  • Consistent scope identifiers for memory operation
  • Optimize routing examples

Summary by CodeRabbit

  • New Features

    • Added intra‑node distributed all‑to‑all examples (including 2D torus XY routing) with verification, CLI launch, and optional benchmarking; added remote atomic‑add operation.
  • API Changes

    • Wait and fence APIs now accept explicit memory scope and semantic parameters; legacy single‑arg wait variant removed.
    • Barrier API gains a need_fence flag to control automatic fencing.
  • Refactor

    • Replaced string‑based scope/semantic parameters with enum types for clearer, safer calls.

@coderabbitai
Copy link

coderabbitai bot commented Feb 4, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds enum-based memory scopes/semantics, expands wait/fence/atomic primitives to be scope/semantic-aware (including cross-PE waits), introduces a remote atomic-add TileLang operator with PTX lowering, updates CUDA templates/codegen and wait/fence signatures, and adds two intra-node all‑to‑all example scripts.

Changes

Cohort / File(s) Summary
Distributed all‑to‑all examples
examples/distributed/intranode/example_alltoall.py, examples/distributed/intranode/example_alltoall_route2x4.py
Add two intra-node all‑to‑all example scripts: tiled alltoall and torus XY‑routing multi‑phase alltoall with TileLang kernels, per-rank launchers, multiprocessing spawn, PyTorch reference verification, and optional benchmarking/timing.
Remote atomic operations
src/op/remote_copy.h, src/op/remote_copy.cc
Add AtomAddRemoteOpNode/AtomAddRemoteOp types, registration and reflection; implement lowering to PTX atom-add naming that encodes semantic/scope; update semantic/scope mappings and bounds checks used during lowering.
Synchronization core & operators
src/op/sync.h, src/op/sync.cc
Remove wait_eq builtin; add scope and semantic fields to WaitOpNode; change WaitOp registration arity (4→6) and BarrierBlocksOp arity (1→2); parse/validate extra args and forward scope/semantic in lowering.
CUDA templates (atomics / ldst / sync)
src/tl_templates/cuda/atomic.h, src/tl_templates/cuda/ldst.h, src/tl_templates/cuda/sync.h
Expand Semantic and Scope enums (add RELAXED, ACQ_REL, CLUSTER), change PTX atomic helpers to accept raw addresses, introduce SyncScope/SyncSemantic enums and semantic-aware memory_fence_/ld/wait helpers; expose wait/memory_fence APIs with semantic parameters.
CUDA codegen
src/target/codegen_cuda.cc
Remove special-case emission for wait_eq; forward fence semantic argument when emitting tl::memory_fence_* calls (include first arg in emitted call).
Language layer / Python API
tilelang/language/utils.py, tilelang/language/builtin.py, tilelang/language/distributed/common.py
Add MemoryScope and MemorySemantic enums; update builtin wrappers (ld/st/atom_add/fence) to accept enums and pass numeric values; add atom_add_remote(); extend distributed wait_* helpers to accept peer/scope/semantic and forward to tl.wait.
Transform update
src/transform/warp_specialized_rewriter.cc
Replace wait_eq() check with WaitOp::Get() in warp role promotion logic to reflect WaitOp representation change.

Sequence Diagram(s)

sequenceDiagram
  participant Launcher as Launcher (per-rank)
  participant TLKernel as TileLang Kernel
  participant CUDA as CUDA/PTX
  participant RemotePE as Remote PE memory
  participant Verifier as PyTorch verifier

  Launcher->>TLKernel: allocate buffers, init barriers/signals
  Launcher->>TLKernel: launch kernel (alltoall / torus)
  TLKernel->>CUDA: perform tiled copies and routing phases
  TLKernel->>CUDA: call atom_add_remote(addr,val,sem,scope)
  CUDA->>RemotePE: execute tl::ptx_atom_add_<sem>_<scope>(addr,val)
  RemotePE-->>CUDA: memory updated / ack
  TLKernel->>TLKernel: wait/barrier with scope/semantic
  TLKernel-->>Launcher: kernel completes
  Launcher->>Verifier: run PyTorch all_to_all_single and compare results
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • chengyupku

Poem

"I'm a hopping rabbit, coding speedy threads,
I stamp semantics where synchronization treads.
Atoms leap remote and fences hum in tune,
All‑to‑all parcels dance beneath the moon.
Hooray — data routed, now I munch a rune!"

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title '[Feature] Semantics support and remote atomic add' accurately summarizes the main changes, which include adding semantics/scope support for wait operations, implementing remote atomic add operations, and providing examples.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch tzj-dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

github-actions bot commented Feb 4, 2026

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@tzj-fxz tzj-fxz marked this pull request as ready for review February 5, 2026 08:03
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/tl_templates/cuda/atomic.h (1)

244-281: ⚠️ Potential issue | 🟠 Major

Replace unsigned long with uintptr_t for device address parameters.

unsigned long is 32‑bit on some platforms (Windows, 32‑bit systems), risking silent truncation of 64‑bit device addresses. Use uintptr_t to maintain pointer‑width correctness across all platforms. This applies to all four sys-scope atomic functions: ptx_atom_add_relaxed_sys, ptx_atom_add_acquire_sys, ptx_atom_add_release_sys, and ptx_atom_add_acq_rel_sys.

🔧 Suggested fix
+#include <cstdint>
 ...
-TL_DEVICE uint32_t ptx_atom_add_relaxed_sys(unsigned long addr,
+TL_DEVICE uint32_t ptx_atom_add_relaxed_sys(uintptr_t addr,
                                             uint32_t value) {
   uint32_t ret;
   const uint32_t *ptr = reinterpret_cast<const uint32_t *>(addr);
   asm volatile("atom.add.relaxed.sys.global.u32 %0, [%1], %2;\n"
                : "=r"(ret)
                : "l"(ptr), "r"(value));
   return ret;
 }
 
-TL_DEVICE uint32_t ptx_atom_add_acquire_sys(unsigned long addr,
+TL_DEVICE uint32_t ptx_atom_add_acquire_sys(uintptr_t addr,
                                             uint32_t value) {
   uint32_t ret;
   const uint32_t *ptr = reinterpret_cast<const uint32_t *>(addr);
   asm volatile("atom.add.acquire.sys.global.u32 %0, [%1], %2;\n"
                : "=r"(ret)
                : "l"(ptr), "r"(value));
   return ret;
 }
 
-TL_DEVICE uint32_t ptx_atom_add_release_sys(unsigned long addr,
+TL_DEVICE uint32_t ptx_atom_add_release_sys(uintptr_t addr,
                                             uint32_t value) {
   uint32_t ret;
   const uint32_t *ptr = reinterpret_cast<const uint32_t *>(addr);
   asm volatile("atom.add.release.sys.global.u32 %0, [%1], %2;\n"
                : "=r"(ret)
                : "l"(ptr), "r"(value));
   return ret;
 }
 
-TL_DEVICE uint32_t ptx_atom_add_acq_rel_sys(unsigned long addr,
+TL_DEVICE uint32_t ptx_atom_add_acq_rel_sys(uintptr_t addr,
                                             uint32_t value) {
   uint32_t ret;
   const uint32_t *ptr = reinterpret_cast<const uint32_t *>(addr);
   asm volatile("atom.add.acq_rel.sys.global.u32 %0, [%1], %2;\n"
                : "=r"(ret)
                : "l"(ptr), "r"(value));
   return ret;
 }
src/op/sync.h (1)

129-171: ⚠️ Potential issue | 🔴 Critical

Add need_fence to SEqualReduce, SHashReduce, and RegisterReflection.

The need_fence field (line 135) is used in the Lower method to affect code generation but is omitted from equality, hashing, and reflection checks. This causes two BarrierBlocksOpNode instances with different need_fence values to compare as equal and hash identically, creating a correctness issue for CSE and other IR-based optimizations.

Required changes

Add need_fence to:

  • RegisterReflection: .def_ro("need_fence", &BarrierBlocksOpNode::need_fence)
  • SEqualReduce: Include equal(need_fence, other->need_fence) in the return statement
  • SHashReduce: Add hash_reduce(need_fence);
🤖 Fix all issues with AI agents
In `@examples/distributed/intranode/example_alltoall_route2x4.py`:
- Line 42: The kernel signature currently includes an unused parameter barrier;
update the kernel or its invocation to remove the unused parameter or use it for
synchronization: either delete barrier from the kernel signature and also stop
allocating/passing it in run_torus_alltoall, or add explicit barrier
synchronization inside the kernel (referencing the kernel function where
barrier: T.Tensor((PE_num), "int32") is declared) and ensure run_torus_alltoall
continues to allocate and pass the barrier tensor so the symbol names match
(update both the kernel definition and the call sites in run_torus_alltoall).
- Around line 279-288: The script does not validate that X * Y equals PE_num
before spawning processes, which can cause incorrect torus topology; after
parsing args in the main block (the argparse code that sets args.M, args.N,
args.PE_num, args.X, args.Y) add a check that args.X * args.Y == args.PE_num and
raise a clear error (ValueError or argparse.ArgumentTypeError) or call
parser.error with a descriptive message if the equality fails so
run_torus_alltoall is only invoked with a consistent torus configuration.

In `@examples/distributed/intranode/example_alltoall.py`:
- Around line 73-75: The bandwidth formula is computing elements/sec but not
bytes/sec; update the print calculation to multiply by the element size (in
bytes) — e.g., use dtype_itemsize = np.dtype(tensor_dtype).itemsize or hard-code
2 for float16 — then compute total_bytes = 2 * PE_num * M * N * dtype_itemsize
and compute bandwidth_gbps = total_bytes / (elapsed_time * 1e6) (use the same
elapsed_time units as the printed ms) and print bandwidth_gbps; update the
expression where local_rank, PE_num, M, N, elapsed_time are used to include this
element-size factor.
- Around line 12-37: The kernel declares a barrier parameter that is never used
in main; either remove it from the kernel signature (and corresponding callers)
or implement inter-rank synchronization using that barrier instead of relying
solely on T.fence_sys. If you choose to use the barrier, after T.put_block and
T.fence_sys update the barrier tensor to signal this rank’s completion and then
wait/spin until all PE_num entries are set (i.e., implement a full all-ranks
barrier using the barrier buffer), referencing the symbols main, barrier,
T.put_block and T.fence_sys; if you choose removal, delete the barrier parameter
from main and all call sites.

In `@src/op/remote_copy.cc`:
- Around line 422-431: The code constructs a function name using sem_str[sem]
and scope_str[scope] without validating indices, which can lead to buffer
overreads; add bounds checks for sem (<6) and scope (<4) before indexing
(similar to StOpNode::Lower and LdOpNode::Lower), and handle out-of-range values
by selecting a safe default string or emitting an error/log and returning early
so the use in building "tl::ptx_atom_add_<sem>_<scope>" cannot index past
sem_str or scope_str.

In `@src/op/remote_copy.h`:
- Around line 327-328: Update the field comments for the "scope" and "sem"
members to match the actual enum/mapping used in the implementation: describe
scope as "{0: cta, 1: cluster, 2: gpu, 3: sys}" and describe sem as "{0: weak,
1: volatile, 2: relaxed, 3: acquire, 4: release, 5: acq_rel}" so the
documentation for the int members scope and sem in remote_copy.h matches the
implemented semantics.

In `@tilelang/language/builtin.py`:
- Around line 839-854: The is_remote check in the function incorrectly treats a
plain Python int dst_pe == -1 as remote because it only checks
isinstance(dst_pe, tir.IntImm); update the logic in the is_remote computation
(the variable named is_remote) to also consider a plain int -1 as local—e.g.,
treat dst_pe == -1 the same as tir.IntImm with value == -1 or normalize dst_pe
via getattr(dst_pe, "value", dst_pe) before comparison—so the remote branch (the
code that builds local_rank, local_base_ptr, offset_to_base, remote_ptr and
calls tir.call_extern with remote_ptr) is only taken when dst_pe truly indicates
a remote PE.
- Around line 737-738: Update the incorrect docstring on the function atom_add:
replace the copy-pasted PTX cp.async barrier description with a concise
description stating that atom_add performs an atomic add operation (e.g.,
performs an atomic addition of a value to a target and returns the previous
value or the new value per implementation), mention relevant semantics such as
data type(s) supported and whether the operation is relaxed/ordered or maps to a
specific PTX atomic instruction, and ensure the docstring matches the function
signature and behavior in atom_add.
- Around line 733-747: The atom_add function currently uses direct dict lookups
for scope_str and sem_str which will raise KeyError for unsupported MemoryScope
or MemorySemantic values; update atom_add to explicitly validate inputs before
using them (check the provided scope against allowed values {MemoryScope.GPU,
MemoryScope.SYSTEM} and sem against {MemorySemantic.RELAXED,
MemorySemantic.ACQUIRE, MemorySemantic.RELEASE, MemorySemantic.ACQ_REL}), and if
invalid raise a clear ValueError mentioning the parameter name, the unsupported
value, and the list of supported options; then proceed to build
scope_str/sem_str and call tir.call_intrin(address_of(barrier), value, sem_str,
scope_str) as before.
🧹 Nitpick comments (10)
src/op/remote_copy.cc (1)

276-281: Consider extracting duplicated string arrays to shared constants.

The sem_str and scope_str arrays are duplicated in StOpNode::Lower, LdOpNode::Lower, and AtomAddRemoteOpNode::Lower. This creates a maintenance burden — if the enum mappings change, all three locations must be updated.

Also applies to: 350-355, 424-428

examples/distributed/intranode/example_alltoall.py (1)

83-83: Consider using English comments for consistency.

The comment # 对比结果 (meaning "compare results") is in Chinese. For broader accessibility and consistency, consider using English comments.

✏️ Suggested change
-    # 对比结果
+    # Compare results
examples/distributed/intranode/example_alltoall_route2x4.py (2)

27-27: Unused function argument block_N.

The block_N parameter is passed to the function but never used. Consider removing it if it's not needed.


230-231: Large intermediate buffer allocation - consider documenting memory requirements.

The buffer_transfer tensor has shape (PE_num, PE_num, M, N). With default parameters (PE_num=8, M=8192, N=7168, float16), this requires approximately 7.5 GB per rank. Combined with src, dst, and other tensors, total memory per GPU could exceed 15 GB.

Consider adding a comment or documentation noting the memory requirements, or adding a check to warn users if the allocation might exceed available GPU memory.

src/tl_templates/cuda/ldst.h (1)

5-7: CLUSTER scope has no ld/st implementations.

Scope::CLUSTER is now public, but there are no TL_ST_IMPL/TL_LD_IMPL entries for it, so any use will fall into the unsupported static_assert. If CLUSTER isn’t intended for tl::ld/st yet, consider an explicit guard to make the limitation clearer; if it is intended, add the PTX mappings.

src/tl_templates/cuda/sync.h (1)

239-323: CTA/CLUSTER scopes and non‑acquire semantics collapse to GPU acquire.

ld_wait_generic treats any non‑SYSTEM scope as GPU, and ld_wait_{gpu,sys} only distinguish RELAXED/VOLATILE; all other semantics fall back to acquire. If CTA/CLUSTER or WEAK/RELEASE/ACQ_REL/SC are intended to be distinct, add explicit cases or assert on unsupported values to avoid silent mismatch.

tilelang/language/distributed/common.py (2)

126-131: Type hint for peer parameter is inaccurate.

The type hint peer: PrimExpr | None = -1 is misleading—the default -1 is an int, not None. This pattern repeats across all wait_* functions (lines 144-148, 162-166, 180-184, 198-202, 216-220).

Proposed fix for wait_eq (apply similar changes to other wait_* functions)
 def wait_eq(barrier: PrimExpr,
             expected: PrimExpr,
-            peer: PrimExpr | None = -1,
+            peer: PrimExpr | int = -1,
             scope: MemoryScope = MemoryScope.SYSTEM,
             semantic: MemorySemantic = MemorySemantic.ACQUIRE):

131-132: Address TODO comment about semantic differences.

The TODO at line 132 indicates there are semantic differences between wait_eq and the other wait functions that need clarification. This should be addressed to ensure proper documentation.

Would you like me to help draft documentation clarifying the semantic differences between wait_eq and other wait functions, or open an issue to track this?

tilelang/language/builtin.py (2)

827-833: Duplicate scope/sem mapping logic—consider consolidating.

This scope/semantic mapping (lines 827-833) duplicates the code in atom_add (lines 739-745). Both also lack validation for unsupported values. Consider extracting a helper function to avoid duplication and centralize validation.

Proposed helper functions
def _scope_to_str(scope: MemoryScope) -> str:
    """Convert MemoryScope enum to PTX string representation."""
    mapping = {MemoryScope.GPU: "gpu", MemoryScope.SYSTEM: "sys"}
    if scope not in mapping:
        raise ValueError(f"Unsupported scope for atomic operations: {scope}")
    return mapping[scope]

def _sem_to_str(sem: MemorySemantic) -> str:
    """Convert MemorySemantic enum to PTX string representation."""
    mapping = {
        MemorySemantic.RELAXED: "relaxed",
        MemorySemantic.ACQUIRE: "acquire",
        MemorySemantic.RELEASE: "release",
        MemorySemantic.ACQ_REL: "acq_rel"
    }
    if sem not in mapping:
        raise ValueError(f"Unsupported semantic for atomic operations: {sem}")
    return mapping[sem]

807-813: Type hint for dst_pe is inaccurate.

The type hint tir.PrimExpr | tir.IntImm | None = -1 has issues:

  1. tir.IntImm is a subclass of tir.PrimExpr, making it redundant
  2. The default -1 is a Python int, not covered by the type hint
  3. None is in the type hint but the default is -1, not None
Proposed fix
 def atom_add_remote(
     dst: PrimExpr,
     value: PrimExpr,
     scope: MemoryScope = MemoryScope.SYSTEM,
     sem: MemorySemantic = MemorySemantic.RELAXED,
-    dst_pe: tir.PrimExpr | tir.IntImm | None = -1,
+    dst_pe: tir.PrimExpr | int = -1,
 ):

Comment on lines 279 to 288
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--M", type=int, default=8192)
parser.add_argument("--N", type=int, default=7168)
parser.add_argument("--PE_num", type=int, default=8)
parser.add_argument("--X", type=int, default=2)
parser.add_argument("--Y", type=int, default=4)
args = parser.parse_args()

torch.multiprocessing.spawn(run_torus_alltoall, args=(args.PE_num, args), nprocs=args.PE_num)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add validation that X * Y == PE_num.

The torus topology requires X * Y == PE_num, but this is not validated. If a user provides inconsistent values (e.g., --PE_num 8 --X 3 --Y 3), the kernel would produce incorrect results or fail unexpectedly.

🛡️ Proposed fix
     args = parser.parse_args()
+    
+    if args.X * args.Y != args.PE_num:
+        parser.error(f"X * Y must equal PE_num (got {args.X} * {args.Y} = {args.X * args.Y}, PE_num = {args.PE_num})")

     torch.multiprocessing.spawn(run_torus_alltoall, args=(args.PE_num, args), nprocs=args.PE_num)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--M", type=int, default=8192)
parser.add_argument("--N", type=int, default=7168)
parser.add_argument("--PE_num", type=int, default=8)
parser.add_argument("--X", type=int, default=2)
parser.add_argument("--Y", type=int, default=4)
args = parser.parse_args()
torch.multiprocessing.spawn(run_torus_alltoall, args=(args.PE_num, args), nprocs=args.PE_num)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--M", type=int, default=8192)
parser.add_argument("--N", type=int, default=7168)
parser.add_argument("--PE_num", type=int, default=8)
parser.add_argument("--X", type=int, default=2)
parser.add_argument("--Y", type=int, default=4)
args = parser.parse_args()
if args.X * args.Y != args.PE_num:
parser.error(f"X * Y must equal PE_num (got {args.X} * {args.Y} = {args.X * args.Y}, PE_num = {args.PE_num})")
torch.multiprocessing.spawn(run_torus_alltoall, args=(args.PE_num, args), nprocs=args.PE_num)
🤖 Prompt for AI Agents
In `@examples/distributed/intranode/example_alltoall_route2x4.py` around lines 279
- 288, The script does not validate that X * Y equals PE_num before spawning
processes, which can cause incorrect torus topology; after parsing args in the
main block (the argparse code that sets args.M, args.N, args.PE_num, args.X,
args.Y) add a check that args.X * args.Y == args.PE_num and raise a clear error
(ValueError or argparse.ArgumentTypeError) or call parser.error with a
descriptive message if the equality fails so run_torus_alltoall is only invoked
with a consistent torus configuration.

Comment on lines +12 to +37
@T.prim_func
def main(
src: T.Tensor((PE_num * M, N), "float16"),
dst: T.Tensor((PE_num * M, N), "float16"),
barrier: T.Tensor((PE_num), "int32"),
):
# Currently not support tiled copy
with T.Kernel(
PE_num, T.ceildiv(M, block_M), T.ceildiv(N, block_N),
threads=threads) as (bx, by, bz):
rank = T.alloc_local([1], "int32")
num_ranks = T.alloc_local([1], "int32")

dst_rank = bx
rank[0] = T.get_rank()
num_ranks[0] = T.get_num_ranks()

T.put_block(
src=T.address_of(src[dst_rank * M + by * block_M, 0]),
dst=T.address_of(dst[rank[0] * M + by * block_M, 0]),
size=block_M * block_N,
dst_pe=dst_rank,
)
T.fence_sys(sem=T.MemorySemantic.RELEASE)

return main
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Unused barrier parameter in kernel signature.

The barrier parameter is declared in the kernel signature but never used within the kernel body. This could indicate either:

  1. A missing synchronization step that should use this barrier
  2. An unnecessary parameter that should be removed

Given this is an all-to-all operation, typically a barrier or synchronization mechanism is needed to ensure all ranks have completed their transfers before the kernel returns. Currently, only T.fence_sys is called which provides memory ordering but not inter-rank synchronization.

💡 Suggested fix: Either use the barrier or remove it

Option 1 - Add barrier synchronization:

             T.put_block(
                 src=T.address_of(src[dst_rank * M + by * block_M, 0]),
                 dst=T.address_of(dst[rank[0] * M + by * block_M, 0]),
                 size=block_M * block_N,
                 dst_pe=dst_rank,
             )
             T.fence_sys(sem=T.MemorySemantic.RELEASE)
+            T.barrier_blocks(barrier)

     return main

Option 2 - Remove unused parameter:

     `@T.prim_func`
     def main(
             src: T.Tensor((PE_num * M, N), "float16"),
             dst: T.Tensor((PE_num * M, N), "float16"),
-            barrier: T.Tensor((PE_num), "int32"),
     ):
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@T.prim_func
def main(
src: T.Tensor((PE_num * M, N), "float16"),
dst: T.Tensor((PE_num * M, N), "float16"),
barrier: T.Tensor((PE_num), "int32"),
):
# Currently not support tiled copy
with T.Kernel(
PE_num, T.ceildiv(M, block_M), T.ceildiv(N, block_N),
threads=threads) as (bx, by, bz):
rank = T.alloc_local([1], "int32")
num_ranks = T.alloc_local([1], "int32")
dst_rank = bx
rank[0] = T.get_rank()
num_ranks[0] = T.get_num_ranks()
T.put_block(
src=T.address_of(src[dst_rank * M + by * block_M, 0]),
dst=T.address_of(dst[rank[0] * M + by * block_M, 0]),
size=block_M * block_N,
dst_pe=dst_rank,
)
T.fence_sys(sem=T.MemorySemantic.RELEASE)
return main
`@T.prim_func`
def main(
src: T.Tensor((PE_num * M, N), "float16"),
dst: T.Tensor((PE_num * M, N), "float16"),
):
# Currently not support tiled copy
with T.Kernel(
PE_num, T.ceildiv(M, block_M), T.ceildiv(N, block_N),
threads=threads) as (bx, by, bz):
rank = T.alloc_local([1], "int32")
num_ranks = T.alloc_local([1], "int32")
dst_rank = bx
rank[0] = T.get_rank()
num_ranks[0] = T.get_num_ranks()
T.put_block(
src=T.address_of(src[dst_rank * M + by * block_M, 0]),
dst=T.address_of(dst[rank[0] * M + by * block_M, 0]),
size=block_M * block_N,
dst_pe=dst_rank,
)
T.fence_sys(sem=T.MemorySemantic.RELEASE)
return main
🧰 Tools
🪛 Ruff (0.14.14)

[warning] 16-16: Unused function argument: barrier

(ARG001)


[warning] 21-21: Unpacked variable bz is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

🤖 Prompt for AI Agents
In `@examples/distributed/intranode/example_alltoall.py` around lines 12 - 37, The
kernel declares a barrier parameter that is never used in main; either remove it
from the kernel signature (and corresponding callers) or implement inter-rank
synchronization using that barrier instead of relying solely on T.fence_sys. If
you choose to use the barrier, after T.put_block and T.fence_sys update the
barrier tensor to signal this rank’s completion and then wait/spin until all
PE_num entries are set (i.e., implement a full all-ranks barrier using the
barrier buffer), referencing the symbols main, barrier, T.put_block and
T.fence_sys; if you choose removal, delete the barrier parameter from main and
all call sites.

Comment on lines 73 to 75
print(
f"Rank {local_rank} Kernel execution time: {elapsed_time:.3f} ms, Bandwidth: {2 * PE_num * M * N / (elapsed_time * 1e6):.3f} GB/s"
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Bandwidth calculation may be incorrect - missing dtype size factor.

The bandwidth formula 2 * PE_num * M * N / (elapsed_time * 1e6) calculates based on element count, but GB/s should be based on bytes. For float16 (2 bytes per element), the formula should multiply by the element size:

🔧 Proposed fix
     print(
-        f"Rank {local_rank} Kernel execution time: {elapsed_time:.3f} ms, Bandwidth: {2 * PE_num * M * N / (elapsed_time * 1e6):.3f} GB/s"
+        f"Rank {local_rank} Kernel execution time: {elapsed_time:.3f} ms, Bandwidth: {2 * PE_num * M * N * 2 / (elapsed_time * 1e6):.3f} GB/s"
     )

Alternatively, for clarity:

total_bytes = 2 * PE_num * M * N * 2  # bidirectional * elements * sizeof(float16)
bandwidth_gbps = total_bytes / (elapsed_time * 1e6)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
print(
f"Rank {local_rank} Kernel execution time: {elapsed_time:.3f} ms, Bandwidth: {2 * PE_num * M * N / (elapsed_time * 1e6):.3f} GB/s"
)
print(
f"Rank {local_rank} Kernel execution time: {elapsed_time:.3f} ms, Bandwidth: {2 * PE_num * M * N * 2 / (elapsed_time * 1e6):.3f} GB/s"
)
🤖 Prompt for AI Agents
In `@examples/distributed/intranode/example_alltoall.py` around lines 73 - 75, The
bandwidth formula is computing elements/sec but not bytes/sec; update the print
calculation to multiply by the element size (in bytes) — e.g., use
dtype_itemsize = np.dtype(tensor_dtype).itemsize or hard-code 2 for float16 —
then compute total_bytes = 2 * PE_num * M * N * dtype_itemsize and compute
bandwidth_gbps = total_bytes / (elapsed_time * 1e6) (use the same elapsed_time
units as the printed ms) and print bandwidth_gbps; update the expression where
local_rank, PE_num, M, N, elapsed_time are used to include this element-size
factor.

Comment on lines +422 to +431
// Map integers to semantic literal strings for PTX atom instruction
// Unified Mapping: 2: relaxed, 3: acquire, 4: release, 5: acq_rel
const char *sem_str[] = {"weak", "volatile", "relaxed",
"acquire", "release", "acq_rel"};
const char *scope_str[] = {
"cta", "cluster", "gpu",
"sys"}; // Unified: 2: gpu, 3: system (mapped below)

// Build function name: tl::ptx_atom_add_<sem>_<scope>
ss << "tl::ptx_atom_add_" << sem_str[sem] << "_" << scope_str[scope];
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Missing bounds checks for sem and scope indices.

Unlike StOpNode::Lower (line 283-284) and LdOpNode::Lower (line 357-358), this method does not validate sem and scope before using them as array indices. If sem >= 6 or scope >= 4, this causes undefined behavior (buffer overread).

🐛 Proposed fix to add bounds checks
   const char *sem_str[] = {"weak",    "volatile", "relaxed",
                            "acquire", "release",  "acq_rel"};
   const char *scope_str[] = {
       "cta", "cluster", "gpu",
       "sys"}; // Unified: 2: gpu, 3: system (mapped below)
 
+  ICHECK_LT(sem, 6) << "Invalid semantic value: " << sem;
+  ICHECK_LT(scope, 4) << "Invalid scope value: " << scope;
+
   // Build function name: tl::ptx_atom_add_<sem>_<scope>
   ss << "tl::ptx_atom_add_" << sem_str[sem] << "_" << scope_str[scope];
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Map integers to semantic literal strings for PTX atom instruction
// Unified Mapping: 2: relaxed, 3: acquire, 4: release, 5: acq_rel
const char *sem_str[] = {"weak", "volatile", "relaxed",
"acquire", "release", "acq_rel"};
const char *scope_str[] = {
"cta", "cluster", "gpu",
"sys"}; // Unified: 2: gpu, 3: system (mapped below)
// Build function name: tl::ptx_atom_add_<sem>_<scope>
ss << "tl::ptx_atom_add_" << sem_str[sem] << "_" << scope_str[scope];
// Map integers to semantic literal strings for PTX atom instruction
// Unified Mapping: 2: relaxed, 3: acquire, 4: release, 5: acq_rel
const char *sem_str[] = {"weak", "volatile", "relaxed",
"acquire", "release", "acq_rel"};
const char *scope_str[] = {
"cta", "cluster", "gpu",
"sys"}; // Unified: 2: gpu, 3: system (mapped below)
ICHECK_LT(sem, 6) << "Invalid semantic value: " << sem;
ICHECK_LT(scope, 4) << "Invalid scope value: " << scope;
// Build function name: tl::ptx_atom_add_<sem>_<scope>
ss << "tl::ptx_atom_add_" << sem_str[sem] << "_" << scope_str[scope];
🤖 Prompt for AI Agents
In `@src/op/remote_copy.cc` around lines 422 - 431, The code constructs a function
name using sem_str[sem] and scope_str[scope] without validating indices, which
can lead to buffer overreads; add bounds checks for sem (<6) and scope (<4)
before indexing (similar to StOpNode::Lower and LdOpNode::Lower), and handle
out-of-range values by selecting a safe default string or emitting an error/log
and returning early so the use in building "tl::ptx_atom_add_<sem>_<scope>"
cannot index past sem_str or scope_str.

Comment on lines +327 to +328
int scope; ///< Memory scope (0: GPU, 1: SYS)
int sem; ///< Memory semantic (0: relaxed, 1: acquire, 2: release, 3: acq_rel)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Documentation does not match implementation.

The comments describe incorrect mappings:

  • scope: documented as (0: GPU, 1: SYS), but implementation uses {0: cta, 1: cluster, 2: gpu, 3: sys}.
  • sem: documented as (0: relaxed, 1: acquire, 2: release, 3: acq_rel), but implementation uses {0: weak, 1: volatile, 2: relaxed, 3: acquire, 4: release, 5: acq_rel}.

This inconsistency will mislead developers using this API.

📝 Suggested documentation fix
-  int scope;       ///< Memory scope (0: GPU, 1: SYS)
-  int sem; ///< Memory semantic (0: relaxed, 1: acquire, 2: release, 3: acq_rel)
+  int scope;  ///< Memory scope (0: CTA, 1: CLUSTER, 2: GPU, 3: SYS)
+  int sem;    ///< Memory semantic (0: WEAK, 1: VOLATILE, 2: RELAXED, 3: ACQUIRE, 4: RELEASE, 5: ACQ_REL)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
int scope; ///< Memory scope (0: GPU, 1: SYS)
int sem; ///< Memory semantic (0: relaxed, 1: acquire, 2: release, 3: acq_rel)
int scope; ///< Memory scope (0: CTA, 1: CLUSTER, 2: GPU, 3: SYS)
int sem; ///< Memory semantic (0: WEAK, 1: VOLATILE, 2: RELAXED, 3: ACQUIRE, 4: RELEASE, 5: ACQ_REL)
🤖 Prompt for AI Agents
In `@src/op/remote_copy.h` around lines 327 - 328, Update the field comments for
the "scope" and "sem" members to match the actual enum/mapping used in the
implementation: describe scope as "{0: cta, 1: cluster, 2: gpu, 3: sys}" and
describe sem as "{0: weak, 1: volatile, 2: relaxed, 3: acquire, 4: release, 5:
acq_rel}" so the documentation for the int members scope and sem in
remote_copy.h matches the implemented semantics.

Comment on lines 135 to +147
WaitOp::WaitOp(Array<PrimExpr> args, BufferMap vmap) {
ObjectPtr<WaitOpNode> node = make_object<WaitOpNode>();
node->relation = args[0].as<IntImmNode>()->value;
ICHECK_GE(args.size(), 4);
const auto *relation_node = args[0].as<IntImmNode>();
ICHECK(relation_node) << "Wait relation must be an integer";
node->relation = relation_node->value;
node->addr = args[1];
node->expected = args[2];
node->peer = args[3];
// scope parameter is optional, default to SYSTEM (3) for safety
node->scope = (args.size() > 4) ? args[4].as<IntImmNode>()->value : 3;
// semantic parameter is optional, default to ACQUIRE (3) for safety
node->semantic = (args.size() > 5) ? args[5].as<IntImmNode>()->value : 3;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard optional scope/semantic args before deref.

args[4] and args[5] are assumed IntImmNode and dereferenced directly, which will crash if a non-constant is passed. Add ICHECK (or accept non-const and validate later) to fail fast with a clear error.

✅ Suggested fix
-  node->scope = (args.size() > 4) ? args[4].as<IntImmNode>()->value : 3;
-  // semantic parameter is optional, default to ACQUIRE (3) for safety
-  node->semantic = (args.size() > 5) ? args[5].as<IntImmNode>()->value : 3;
+  if (args.size() > 4) {
+    const auto *scope_node = args[4].as<IntImmNode>();
+    ICHECK(scope_node) << "Wait scope must be an integer";
+    node->scope = scope_node->value;
+  } else {
+    node->scope = 3;
+  }
+  // semantic parameter is optional, default to ACQUIRE (3) for safety
+  if (args.size() > 5) {
+    const auto *semantic_node = args[5].as<IntImmNode>();
+    ICHECK(semantic_node) << "Wait semantic must be an integer";
+    node->semantic = semantic_node->value;
+  } else {
+    node->semantic = 3;
+  }

Comment on lines +733 to +747
def atom_add(barrier: PrimExpr,
value: PrimExpr,
scope: MemoryScope = MemoryScope.GPU,
sem: MemorySemantic = MemorySemantic.RELAXED):
"""Perform a ptx async copy barrier using cp.async.mbarrier.arrive.noinc.
"""
assert scope in ["gpu", "sys"], "Scope must be one of 'gpu', or 'sys'."
assert sem in ["relaxed", "acquire", "release", "acq_rel"
], "Semantic must be one of 'relaxed', 'acquire', 'release', or 'acq_rel'."
return tir.call_intrin("uint32", tir.op.Op.get("tl.atom_add"), address_of(barrier), value, sem,
scope)
scope_str = {MemoryScope.GPU: "gpu", MemoryScope.SYSTEM: "sys"}[scope]
sem_str = {
MemorySemantic.RELAXED: "relaxed",
MemorySemantic.ACQUIRE: "acquire",
MemorySemantic.RELEASE: "release",
MemorySemantic.ACQ_REL: "acq_rel"
}[sem]
return tir.call_intrin("uint32", tir.op.Op.get("tl.atom_add"), address_of(barrier), value,
sem_str, scope_str)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing error handling for unsupported scope/semantic values.

The dictionary lookups at lines 739-745 will raise KeyError if called with unsupported values (e.g., MemoryScope.CTA, MemoryScope.CLUSTER, or MemorySemantic.WEAK, MemorySemantic.VOLATILE, MemorySemantic.SC). Consider adding explicit validation with a helpful error message.

Proposed fix with validation
 def atom_add(barrier: PrimExpr,
              value: PrimExpr,
              scope: MemoryScope = MemoryScope.GPU,
              sem: MemorySemantic = MemorySemantic.RELAXED):
     """Perform a ptx async copy barrier using cp.async.mbarrier.arrive.noinc.
     """
-    scope_str = {MemoryScope.GPU: "gpu", MemoryScope.SYSTEM: "sys"}[scope]
-    sem_str = {
+    scope_map = {MemoryScope.GPU: "gpu", MemoryScope.SYSTEM: "sys"}
+    sem_map = {
         MemorySemantic.RELAXED: "relaxed",
         MemorySemantic.ACQUIRE: "acquire",
         MemorySemantic.RELEASE: "release",
         MemorySemantic.ACQ_REL: "acq_rel"
-    }[sem]
+    }
+    if scope not in scope_map:
+        raise ValueError(f"atom_add only supports GPU and SYSTEM scopes, got {scope}")
+    if sem not in sem_map:
+        raise ValueError(f"atom_add only supports RELAXED, ACQUIRE, RELEASE, ACQ_REL semantics, got {sem}")
+    scope_str = scope_map[scope]
+    sem_str = sem_map[sem]
     return tir.call_intrin("uint32", tir.op.Op.get("tl.atom_add"), address_of(barrier), value,
                            sem_str, scope_str)
🤖 Prompt for AI Agents
In `@tilelang/language/builtin.py` around lines 733 - 747, The atom_add function
currently uses direct dict lookups for scope_str and sem_str which will raise
KeyError for unsupported MemoryScope or MemorySemantic values; update atom_add
to explicitly validate inputs before using them (check the provided scope
against allowed values {MemoryScope.GPU, MemoryScope.SYSTEM} and sem against
{MemorySemantic.RELAXED, MemorySemantic.ACQUIRE, MemorySemantic.RELEASE,
MemorySemantic.ACQ_REL}), and if invalid raise a clear ValueError mentioning the
parameter name, the unsupported value, and the list of supported options; then
proceed to build scope_str/sem_str and call tir.call_intrin(address_of(barrier),
value, sem_str, scope_str) as before.

Comment on lines 737 to 738
"""Perform a ptx async copy barrier using cp.async.mbarrier.arrive.noinc.
"""
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Incorrect docstring for atom_add.

The docstring describes "ptx async copy barrier using cp.async.mbarrier.arrive.noinc" which is incorrect—this function performs an atomic add operation. This appears to be copy-pasted from cp_async_barrier_noinc.

Proposed fix
 def atom_add(barrier: PrimExpr,
              value: PrimExpr,
              scope: MemoryScope = MemoryScope.GPU,
              sem: MemorySemantic = MemorySemantic.RELAXED):
-    """Perform a ptx async copy barrier using cp.async.mbarrier.arrive.noinc.
+    """Perform an atomic add operation on the given address.
+
+    Args:
+        barrier: The address to perform atomic add on.
+        value: The value to add.
+        scope: The memory scope (GPU or SYSTEM).
+        sem: The memory semantic (RELAXED, ACQUIRE, RELEASE, or ACQ_REL).
+
+    Returns:
+        tir.Call: The old value before the atomic add.
     """
🤖 Prompt for AI Agents
In `@tilelang/language/builtin.py` around lines 737 - 738, Update the incorrect
docstring on the function atom_add: replace the copy-pasted PTX cp.async barrier
description with a concise description stating that atom_add performs an atomic
add operation (e.g., performs an atomic addition of a value to a target and
returns the previous value or the new value per implementation), mention
relevant semantics such as data type(s) supported and whether the operation is
relaxed/ordered or maps to a specific PTX atomic instruction, and ensure the
docstring matches the function signature and behavior in atom_add.

Comment on lines +839 to +854
is_remote = not (isinstance(dst_pe, tir.IntImm) and dst_pe.value == -1)

if is_remote:
# Compute remote address: remote_base_ptr(dst_pe) + (address_of(dst) - remote_base_ptr(get_rank()))
local_rank = tir.Call("int64", tir.op.Op.get("tl.get_rank"), [])
local_base_ptr = tir.Call("handle", tir.op.Op.get("tl.get_remote_base_ptr"), [local_rank])
offset_to_base = tir.Sub(
tir.Call("handle", tir.op.Op.get("tl.get_uintptr_t"), [address_of(dst)]),
local_base_ptr)
remote_ptr = tir.Add(
tir.Call("handle", tir.op.Op.get("tl.get_remote_base_ptr"), [dst_pe]), offset_to_base)
# Call the PTX intrinsic directly with remote address
return tir.call_extern("uint32", func_name, remote_ptr, value)
else:
# Local atomic add
return tir.call_extern("uint32", func_name, address_of(dst), value)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Bug: is_remote check doesn't handle Python int default value.

The check at line 839 only handles tir.IntImm:

is_remote = not (isinstance(dst_pe, tir.IntImm) and dst_pe.value == -1)

However, the default value -1 is a Python int, not a tir.IntImm. When called with the default dst_pe=-1, isinstance(dst_pe, tir.IntImm) is False, making is_remote = True, which incorrectly triggers the remote path for local operations.

Proposed fix
     # If dst_pe is specified and not -1, compute remote address
-    is_remote = not (isinstance(dst_pe, tir.IntImm) and dst_pe.value == -1)
+    is_local = (isinstance(dst_pe, int) and dst_pe == -1) or \
+               (isinstance(dst_pe, tir.IntImm) and dst_pe.value == -1)
+    is_remote = not is_local
 
     if is_remote:
🤖 Prompt for AI Agents
In `@tilelang/language/builtin.py` around lines 839 - 854, The is_remote check in
the function incorrectly treats a plain Python int dst_pe == -1 as remote
because it only checks isinstance(dst_pe, tir.IntImm); update the logic in the
is_remote computation (the variable named is_remote) to also consider a plain
int -1 as local—e.g., treat dst_pe == -1 the same as tir.IntImm with value == -1
or normalize dst_pe via getattr(dst_pe, "value", dst_pe) before comparison—so
the remote branch (the code that builds local_rank, local_base_ptr,
offset_to_base, remote_ptr and calls tir.call_extern with remote_ptr) is only
taken when dst_pe truly indicates a remote PE.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@examples/distributed/intranode/example_alltoall_route2x4.py`:
- Around line 325-327: The three print statements use f-strings without any
placeholders (e.g., print(f"Benchmark Results:"), print(f"  Average Latency:
{elapsed_time_ms:.4f} ms") — actually the latter two do use placeholders so only
remove the redundant f-prefix from lines that have no placeholders; specifically
change print(f"Benchmark Results:") to print("Benchmark Results:"), and likewise
any other literal-only prints in the same block, leaving prints that format
variables (like elapsed_time_ms and bandwidth_gbps) as f-strings; update the
print statements in the block containing "Benchmark Results:", "  Average
Latency:", and "  Effective Bandwidth:" accordingly.
- Around line 233-236: Validate and guard block sizing before launching the
kernel: ensure block_M is > 0 and that M is evenly divisible by block_M (or
implement tail handling) so the kernel’s copy of block_M * N elements cannot go
out-of-bounds; specifically, check/adjust the variables M, block_M, block_N and
threads used in the setup (where block_M = M // 8, block_N = N, num_blocks_M = M
// block_M) to either compute num_blocks_M = ceil_div(M, block_M) and handle the
final partial block inside the kernel or error out with a clear message when
block_M == 0 or M % block_M != 0.
- Around line 27-28: The function torus_alltoall_xy currently declares a
parameter block_N which is never used; remove block_N from the function
signature and from all call-sites that pass it (ensure callers now pass only
PE_num, X, Y, M, N, block_M, threads), and update any related documentation or
tests that reference the old signature; search for usages of torus_alltoall_xy
to update invocations and adjust any higher-level wrappers or type hints
referencing block_N to keep the API consistent.

Comment on lines 233 to 236
M, N = args.M, args.N
block_M, block_N = M // 8, N
threads = 128
num_blocks_M = M // block_M
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard against invalid block sizing (OOB/div-by-zero).
The kernel always copies block_M * N elements, so when M % block_M != 0 or block_M == 0, you can hit OOB or divide-by-zero. Add validation or tail handling.

✅ Suggested validation
-    block_M, block_N = M // 8, N
+    block_M = max(1, M // 8)
+    if M % block_M != 0:
+        raise ValueError(
+            f"M must be divisible by block_M (M // 8). Got M={M}, block_M={block_M}"
+        )
     threads = 128
     num_blocks_M = M // block_M
🤖 Prompt for AI Agents
In `@examples/distributed/intranode/example_alltoall_route2x4.py` around lines 233
- 236, Validate and guard block sizing before launching the kernel: ensure
block_M is > 0 and that M is evenly divisible by block_M (or implement tail
handling) so the kernel’s copy of block_M * N elements cannot go out-of-bounds;
specifically, check/adjust the variables M, block_M, block_N and threads used in
the setup (where block_M = M // 8, block_N = N, num_blocks_M = M // block_M) to
either compute num_blocks_M = ceil_div(M, block_M) and handle the final partial
block inside the kernel or error out with a clear message when block_M == 0 or M
% block_M != 0.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@examples/distributed/intranode/example_alltoall_route2x4.py`:
- Around line 303-316: The benchmark loops reuse tensors so local_finish,
global_finish, and signal_transfer retain values across runs causing kernel (the
all-to-all transfer loop) to skip work; before each warmup and each timed
iteration (i.e., before calling kernel in both the 5 warmup iterations and
inside the num_iters loop) reset/zero-out those tensors (local_finish,
global_finish, signal_transfer) on the appropriate device so the kernel’s
while-loop condition behaves correctly; locate references to kernel(...,
buffer_transfer, signal_transfer, local_finish, global_finish, barrier) and
insert resets just prior to those calls.
🧹 Nitpick comments (2)
examples/distributed/intranode/example_alltoall_route2x4.py (2)

141-141: Remove debug print statement.

This T.print appears to be a leftover debug statement that will generate output during kernel execution, which may not be desired in production.

🧹 Proposed fix
-                T.print(bz, msg="block_idx")

199-212: Termination broadcast has O(PE_num³ × num_blocks_M) complexity.

The quad-nested loop broadcasts termination signals to all signal slots across all PEs. For larger configurations (e.g., 16 PEs with many blocks), this single-threaded broadcast from one block could become a bottleneck. Consider whether a more efficient termination mechanism (e.g., hierarchical signaling or collective barrier) would be beneficial for scalability.

Comment on lines 303 to 316
if args.benchmark:
# Warmup
for _ in range(5):
kernel(src, dst, buffer_transfer, signal_transfer, local_finish, global_finish, barrier)
torch.cuda.synchronize()
dist.barrier(group_size)

start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)

num_iters = 10
start_event.record()
for _ in range(num_iters):
kernel(src, dst, buffer_transfer, signal_transfer, local_finish, global_finish, barrier)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Signal and finish tensors are not reset between benchmark iterations.

After the first kernel execution, local_finish, global_finish, and signal_transfer will contain stale values from the previous run. The kernel's while-loop condition (global_finish[0] < PE_num) will be immediately false on subsequent iterations, causing the kernel to skip the transfer phase entirely.

Reset these tensors before each iteration or at least before the warmup/timed loops.

🐛 Proposed fix
     if args.benchmark:
         # Warmup
         for _ in range(5):
+            signal_transfer.fill_(PE_num + 1)
+            local_finish.fill_(0)
+            global_finish.fill_(0)
             kernel(src, dst, buffer_transfer, signal_transfer, local_finish, global_finish, barrier)
         torch.cuda.synchronize()
         dist.barrier(group_size)

         start_event = torch.cuda.Event(enable_timing=True)
         end_event = torch.cuda.Event(enable_timing=True)

         num_iters = 10
         start_event.record()
         for _ in range(num_iters):
+            signal_transfer.fill_(PE_num + 1)
+            local_finish.fill_(0)
+            global_finish.fill_(0)
             kernel(src, dst, buffer_transfer, signal_transfer, local_finish, global_finish, barrier)
         end_event.record()
🤖 Prompt for AI Agents
In `@examples/distributed/intranode/example_alltoall_route2x4.py` around lines 303
- 316, The benchmark loops reuse tensors so local_finish, global_finish, and
signal_transfer retain values across runs causing kernel (the all-to-all
transfer loop) to skip work; before each warmup and each timed iteration (i.e.,
before calling kernel in both the 5 warmup iterations and inside the num_iters
loop) reset/zero-out those tensors (local_finish, global_finish,
signal_transfer) on the appropriate device so the kernel’s while-loop condition
behaves correctly; locate references to kernel(..., buffer_transfer,
signal_transfer, local_finish, global_finish, barrier) and insert resets just
prior to those calls.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@examples/distributed/intranode/example_alltoall_route2x4.py`:
- Around line 236-243: The host-side block computation can produce num_blocks ==
0 and trigger a ZeroDivisionError when computing tile_M; modify the calculation
of num_blocks (used before computing tile_M) to guard against zero by ensuring
num_blocks = max(1, min(num_tiles, 148 // (PE_num * PE_num))) (or explicitly
check if num_blocks == 0 and set it to 1) so the subsequent tile_M = M //
(num_blocks * num_warps) cannot divide by zero; update references to num_blocks,
num_tiles, num_warps and tile_M accordingly.
- Around line 27-35: In torus_alltoall_xy, remove the hardcoded num_SM=148 and
either query the device SM count (e.g., cuda.get_device().MULTIPROCESSOR_COUNT
or similar) or add num_SM as a function parameter; remove or use the unused
tile_N parameter (Ruff ARG001) — if block_N was intended, rename consistently;
compute num_blocks = max(1, min(num_SM // (PE_num * PE_num), num_tiles)) to
avoid ZeroDivisionError when PE_num is large, then compute block_M,
tiles_per_block and keep the existing tiles_per_block <= num_warps assertion
using num_warps = threads // 32. Ensure all references use the updated names
(num_SM, num_blocks, block_M, tiles_per_block, num_warps).
🧹 Nitpick comments (1)
examples/distributed/intranode/example_alltoall.py (1)

40-56: group_size is misleading — it's a ProcessGroup, not a size.

init_dist returns (rank, world_size, ProcessGroup). The name group_size suggests a numeric value, but it's actually a dist.ProcessGroup object passed to dist.barrier() and dist.all_to_all_single(). Consider renaming to group for clarity.

Also, after line 48, local_rank is actually the global rank (from dist.get_rank()). In a single-node setup these are identical, but the naming could confuse readers (and would silently break if extended to multi-node).

♻️ Suggested rename
-    local_rank, num_ranks, group_size = init_dist(local_rank, num_ranks)
+    rank, world_size, group = init_dist(local_rank, num_ranks)

And update all downstream references accordingly.

Comment on lines 27 to 35
def torus_alltoall_xy(PE_num, X, Y, M, N, tile_M, tile_N, threads):
num_SM = 148
num_tiles = M // tile_M
num_blocks = min(num_SM // (PE_num * PE_num), num_tiles)
num_warps = threads // 32
block_M = M // num_blocks
tiles_per_block = block_M // tile_M

assert tiles_per_block <= num_warps, "Each warp should handle the signal and transfer of one tile"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

num_SM is hardcoded to 148 (H100-specific) and tile_N is unused; large PE_num causes ZeroDivisionError.

  1. num_SM = 148 is specific to H100/H200. On other architectures (e.g., A100 with 108 SMs), the occupancy calculation would be wrong.
  2. tile_N is never used inside the function (Ruff ARG001) — likely a leftover from the block_N rename.
  3. When PE_num >= 13, num_SM // (PE_num * PE_num) evaluates to 0, causing ZeroDivisionError at line 32 (block_M = M // num_blocks) and line 33.
🛡️ Suggested guard
     num_SM = 148
     num_tiles = M // tile_M
     num_blocks = min(num_SM // (PE_num * PE_num), num_tiles)
+    assert num_blocks > 0, f"PE_num={PE_num} is too large: num_SM // (PE_num^2) = 0"
     num_warps = threads // 32
🧰 Tools
🪛 Ruff (0.14.14)

[warning] 27-27: Unused function argument: tile_N

(ARG001)

🤖 Prompt for AI Agents
In `@examples/distributed/intranode/example_alltoall_route2x4.py` around lines 27
- 35, In torus_alltoall_xy, remove the hardcoded num_SM=148 and either query the
device SM count (e.g., cuda.get_device().MULTIPROCESSOR_COUNT or similar) or add
num_SM as a function parameter; remove or use the unused tile_N parameter (Ruff
ARG001) — if block_N was intended, rename consistently; compute num_blocks =
max(1, min(num_SM // (PE_num * PE_num), num_tiles)) to avoid ZeroDivisionError
when PE_num is large, then compute block_M, tiles_per_block and keep the
existing tiles_per_block <= num_warps assertion using num_warps = threads // 32.
Ensure all references use the updated names (num_SM, num_blocks, block_M,
tiles_per_block, num_warps).

Comment on lines 236 to 243
tile_M, tile_N = M // 16, N
threads = 256

num_tiles = M // tile_M
num_blocks = min(num_tiles, 148 // (PE_num * PE_num))
num_warps = threads // 32
# Modify tile_M to use all warps
tile_M = M // (num_blocks * num_warps)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Same ZeroDivisionError risk in host-side block computation.

The host duplicates the kernel's num_blocks calculation (line 240). For PE_num >= 13, num_blocks becomes 0, causing ZeroDivisionError at line 243 (tile_M = M // (num_blocks * num_warps)). Add a guard here as well.

🛡️ Suggested guard
     num_tiles = M // tile_M
     num_blocks = min(num_tiles, 148 // (PE_num * PE_num))
+    if num_blocks == 0:
+        raise ValueError(f"PE_num={PE_num} is too large for num_SM=148: cannot allocate any blocks")
     num_warps = threads // 32
🤖 Prompt for AI Agents
In `@examples/distributed/intranode/example_alltoall_route2x4.py` around lines 236
- 243, The host-side block computation can produce num_blocks == 0 and trigger a
ZeroDivisionError when computing tile_M; modify the calculation of num_blocks
(used before computing tile_M) to guard against zero by ensuring num_blocks =
max(1, min(num_tiles, 148 // (PE_num * PE_num))) (or explicitly check if
num_blocks == 0 and set it to 1) so the subsequent tile_M = M // (num_blocks *
num_warps) cannot divide by zero; update references to num_blocks, num_tiles,
num_warps and tile_M accordingly.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@examples/distributed/intranode/example_alltoall_route2x4.py`:
- Around line 20-26: The decorator call to tilelang.jit currently passes a
hardcoded developer-specific debug_root_path
(debug_root_path="/home/zhengju.tang/...") which must be removed or made
portable; update the tilelang.jit invocation (the pass_configs and debug
options) to either omit the debug_root_path argument entirely or replace it with
a relative path or environment-driven value (e.g., read from an env var or use
"./debug") so the example contains no absolute user home directories.
- Around line 114-157: chunk_size can become negative when tile_M < PE_num;
before calling T.put_warp (both branches) compute and guard chunk_size so it is
non-negative—for example clamp available = T.max(0, tile_M - chunk_start) and
set chunk_size = T.min(chunk_M, available) (or skip the
T.put_warp/T.atom_add/T.sync_warp sequence when chunk_size <= 0). Update the
calculations around chunk_M, chunk_start, chunk_size and the put_warp calls to
use this non-negative-clamped value (references: chunk_M, chunk_start,
chunk_size, bx, tile_M, PE_num, and the T.put_warp/T.atom_add calls).

Comment on lines +20 to +26
@tilelang.jit(
pass_configs={
tilelang.PassConfigKey.TL_DISABLE_TMA_LOWER: True,
tilelang.PassConfigKey.TL_DISABLE_WARP_SPECIALIZED: True,
},
debug_root_path="/home/zhengju.tang/tilescale/examples/distributed/debug/"
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove the developer-specific hardcoded debug_root_path.

Line 25 contains an absolute path pointing to a developer's home directory (/home/zhengju.tang/...). This won't work on any other machine, leaks a username, and shouldn't ship in a public example. Either remove the argument entirely or use a relative path.

🔧 Proposed fix
 `@tilelang.jit`(
     pass_configs={
         tilelang.PassConfigKey.TL_DISABLE_TMA_LOWER: True,
         tilelang.PassConfigKey.TL_DISABLE_WARP_SPECIALIZED: True,
     },
-    debug_root_path="/home/zhengju.tang/tilescale/examples/distributed/debug/"
 )
🤖 Prompt for AI Agents
In `@examples/distributed/intranode/example_alltoall_route2x4.py` around lines 20
- 26, The decorator call to tilelang.jit currently passes a hardcoded
developer-specific debug_root_path (debug_root_path="/home/zhengju.tang/...")
which must be removed or made portable; update the tilelang.jit invocation (the
pass_configs and debug options) to either omit the debug_root_path argument
entirely or replace it with a relative path or environment-driven value (e.g.,
read from an env var or use "./debug") so the example contains no absolute user
home directories.

Comment on lines 114 to 157
# Phase 1: Fully use all blocks to initially send from src to the target neighbor
# Split the tile_M to each block

chunk_M = T.ceildiv(tile_M, PE_num)
chunk_start = bx * chunk_M
chunk_size = T.min(chunk_M, tile_M - chunk_start)

# if src_rank[0] == rank[0]:
if dst_rank[0] != rank[0]:
T.put_warp(
T.address_of(src[dst_rank[0] * M + bz * block_M + warp_idx * tile_M + chunk_start, 0]),
T.address_of(buffer_transfer[rank[0], dst_rank[0], bz * block_M + warp_idx * tile_M + chunk_start, 0]),
chunk_size * N,
next_rank[0],
)
if tx % 32 == 0:
T.atom_add_remote(
signal_transfer[rank[0], dst_rank[0], bz, warp_idx],
1,
scope=T.MemoryScope.SYSTEM,
sem=T.MemorySemantic.RELEASE,
dst_pe=next_rank[0],
)
T.sync_warp()
# T.fence_sys(sem=T.MemorySemantic.RELEASE)
else:
T.put_warp(
T.address_of(src[dst_rank[0] * M + bz * block_M + warp_idx * tile_M + chunk_start, 0]),
T.address_of(dst[rank[0] * M + bz * block_M + warp_idx * tile_M + chunk_start, 0]),
chunk_size * N,
-1,
)
if tx % 32 == 0:
old_local[0] = T.atom_add(
local_finish[0],
1,
scope=T.MemoryScope.GPU,
sem=T.MemorySemantic.RELEASE,
)
T.sync_warp()
# T.fence_cta(sem=T.MemorySemantic.RELEASE)

# T.fence_sys(sem=T.MemorySemantic.RELEASE)
T.sync_threads()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Phase 1: chunk_size can become negative for extreme parameter combinations.

When tile_M < PE_num, chunk_start = bx * chunk_M can exceed tile_M, making tile_M - chunk_start negative. T.min(chunk_M, negative_value) would then produce a negative chunk_size, which is passed as the size to put_warp. With the default parameters (M=8192, PE_num=8) this does not trigger, but a guard would prevent misuse.

🛡️ Suggested guard (inside the kernel, before the put_warp calls)
             chunk_M = T.ceildiv(tile_M, PE_num)
             chunk_start = bx * chunk_M
             chunk_size = T.min(chunk_M, tile_M - chunk_start)
 
-            if dst_rank[0] != rank[0]:
+            if chunk_start < tile_M and dst_rank[0] != rank[0]:

(Same guard for the else branch.)

🤖 Prompt for AI Agents
In `@examples/distributed/intranode/example_alltoall_route2x4.py` around lines 114
- 157, chunk_size can become negative when tile_M < PE_num; before calling
T.put_warp (both branches) compute and guard chunk_size so it is
non-negative—for example clamp available = T.max(0, tile_M - chunk_start) and
set chunk_size = T.min(chunk_M, available) (or skip the
T.put_warp/T.atom_add/T.sync_warp sequence when chunk_size <= 0). Update the
calculations around chunk_M, chunk_start, chunk_size and the put_warp calls to
use this non-negative-clamped value (references: chunk_M, chunk_start,
chunk_size, bx, tile_M, PE_num, and the T.put_warp/T.atom_add calls).

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@examples/distributed/intranode/example_alltoall_route2x4.py`:
- Around line 165-166: The condition uses a boolean comparison "flag[0] ==
False"; change it to use a truthiness check by replacing "flag[0] == False" with
"not flag[0]" in the if statement that checks signal_transfer, dst_rank and flag
(the line referencing signal_transfer[bx, dst_rank[0], bz, warp_idx] == PE_num
and flag[0] == False) so it conforms to PEP8 and handles non-boolean flag values
correctly.
- Line 245: The host-side assignment tile_M = M // (num_blocks * num_warps) is
dead code (unused) because the kernel computes its own tile_M; remove this
unused variable assignment (or instead pass it into the kernel if you intended
to use it) to fix the Ruff F841 warning—look for the symbol tile_M and the
related variables M, num_blocks, num_warps in example_alltoall_route2x4.py and
either delete the assignment or wire tile_M into the kernel invocation so it is
actually consumed.
🧹 Nitpick comments (1)
examples/distributed/intranode/example_alltoall_route2x4.py (1)

200-223: Termination detection: old_global[0] only reflects the last loop iteration's return value.

The loop at lines 201–208 calls atom_add_remote to each PE, but old_global[0] is overwritten each iteration and only retains the return from dst_pe = PE_num - 1. The subsequent check at line 209 (old_global[0] + 1 == PE_num) thus only gates on whether PE PE_num-1 has reached PE_num in its global_finish. This works as a distributed termination pattern (exactly one thread across all PEs will observe this transition), but the intent is subtle. A brief comment explaining why checking only the last PE suffices would help future readers.

Comment on lines +165 to +166
if signal_transfer[bx, dst_rank[0], bz, warp_idx] == PE_num and flag[0] == False:
flag[0] = True
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use not flag[0] instead of == False.

Comparing with == False is a PEP 8 / linting violation (E712) and can behave unexpectedly with non-boolean types.

-            if signal_transfer[bx, dst_rank[0], bz, warp_idx] == PE_num and flag[0] == False:
+            if signal_transfer[bx, dst_rank[0], bz, warp_idx] == PE_num and not flag[0]:
🧰 Tools
🪛 Ruff (0.14.14)

[error] 165-165: Avoid equality comparisons to False; use not flag[0]: for false checks

Replace with not flag[0]

(E712)

🤖 Prompt for AI Agents
In `@examples/distributed/intranode/example_alltoall_route2x4.py` around lines 165
- 166, The condition uses a boolean comparison "flag[0] == False"; change it to
use a truthiness check by replacing "flag[0] == False" with "not flag[0]" in the
if statement that checks signal_transfer, dst_rank and flag (the line
referencing signal_transfer[bx, dst_rank[0], bz, warp_idx] == PE_num and flag[0]
== False) so it conforms to PEP8 and handles non-boolean flag values correctly.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@examples/distributed/intranode/example_alltoall_route2x4.py`:
- Around line 276-278: The shape args passed to tilelang.tensor are plain
integers because (1) and (PE_num) are not tuples; change the single-element
shapes to proper 1-tuples when creating tensors in local_finish, global_finish
and barrier by passing (1,) and (PE_num,) to tilelang.tensor (keep allocator and
types the same).
🧹 Nitpick comments (1)
examples/distributed/intranode/example_alltoall_route2x4.py (1)

262-278: Large memory footprint — document hardware requirements.

buffer_transfer alone is (8, 8, 8192, 7168) × 2 bytes ≈ 6.7 GB per PE with default parameters. Combined with src, dst, and the 32 GB allocator, this example requires high-memory GPUs (e.g., H100 80 GB). Consider adding a brief note in the script or README about minimum GPU memory requirements.

Comment on lines +276 to +278
local_finish = tilelang.tensor((1), torch.uint32, allocator=allocator).fill_(0)
global_finish = tilelang.tensor((1), torch.uint32, allocator=allocator).fill_(0)
barrier = tilelang.tensor((PE_num), torch.int32, allocator=allocator).zero_()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Shape arguments (1) and (PE_num) are plain integers, not tuples.

In Python, (1) evaluates to 1 and (PE_num) evaluates to PE_num. The tilelang.tensor function expects a shape tuple. Use trailing commas to create proper 1-element tuples:

-    local_finish = tilelang.tensor((1), torch.uint32, allocator=allocator).fill_(0)
-    global_finish = tilelang.tensor((1), torch.uint32, allocator=allocator).fill_(0)
-    barrier = tilelang.tensor((PE_num), torch.int32, allocator=allocator).zero_()
+    local_finish = tilelang.tensor((1,), torch.uint32, allocator=allocator).fill_(0)
+    global_finish = tilelang.tensor((1,), torch.uint32, allocator=allocator).fill_(0)
+    barrier = tilelang.tensor((PE_num,), torch.int32, allocator=allocator).zero_()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
local_finish = tilelang.tensor((1), torch.uint32, allocator=allocator).fill_(0)
global_finish = tilelang.tensor((1), torch.uint32, allocator=allocator).fill_(0)
barrier = tilelang.tensor((PE_num), torch.int32, allocator=allocator).zero_()
local_finish = tilelang.tensor((1,), torch.uint32, allocator=allocator).fill_(0)
global_finish = tilelang.tensor((1,), torch.uint32, allocator=allocator).fill_(0)
barrier = tilelang.tensor((PE_num,), torch.int32, allocator=allocator).zero_()
🤖 Prompt for AI Agents
In `@examples/distributed/intranode/example_alltoall_route2x4.py` around lines 276
- 278, The shape args passed to tilelang.tensor are plain integers because (1)
and (PE_num) are not tuples; change the single-element shapes to proper 1-tuples
when creating tensors in local_finish, global_finish and barrier by passing (1,)
and (PE_num,) to tilelang.tensor (keep allocator and types the same).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant